Multilingual translation for zero-shot biomedical classification using BioTranslator

Existing annotation paradigms rely on controlled vocabularies, where each data instance is classified into one term from a predefined set of controlled vocabularies. This paradigm restricts the analysis to concepts that are known and well-characterized. Here, we present the novel multilingual translation method BioTranslator to address this problem. BioTranslator takes a user-written textual description of a new concept and then translates this description to a non-text biological data instance. The key idea of BioTranslator is to develop a multilingual translation framework, where multiple modalities of biological data are all translated to text. We demonstrate how BioTranslator enables the identification of novel cell types using only a textual description and how BioTranslator can be further generalized to protein function prediction and drug target identification. Our tool frees scientists from limiting their analyses within predefined controlled vocabularies, enabling them to interact with biological data using free text.


REVIEWER COMMENTS
Reviewer #1 (Remarks to the Author): This paper proposed BioTranslator, an extension of ProTranslator, for cross-model translation using the textual description to identify biological data instances. Experiments show that the tool outperforms ProTranslator, on new cell type discovery and pathway analysis.
In the beginning, I found the motivation of this paper interesting. However, the enthusiasm quickly faded when I discovered that the paper is a small extension of a different paper, "ProTranslator: zeroshot protein function prediction using textual description (published by RECOMB in May 2022)" from the same authors.
The key idea from the two papers -translating the text description to non-text biological data to enable zero-shot classification -are the same. The technological solutions are similar. At the same time, I appreciate that the authors demonstrated the generalizability of the model by fine-tuning the model on larger datasets and evaluating it in more experimental settings.
I really like the work of ProTranslator, but I feel the findings in this paper, compared to ProTranslator, are incremental and do not further advance the field, and thus do not meet the standards in this field and to be published in NC. I would suggest transferring this manuscript to Sceifiic Reports.
Reviewer #2 (Remarks to the Author): This paper introduces an interesting work BioTranslator that performs cross-modal zero-shot learning on ontological terms and biological entities (genes, cells, pathways) to achieve the prediction of novel ontological terms and molecules. The authors conducted extensive experiments to show the generalization of the proposed method to other recent ones and the potential for biological discovery. Overall, the proposed method is interesting and showed its value for biomedical data analysis and discovery, especially for data from new classes, while most previous solutions mainly rely well defined labels. The main idea is borrowed from the heavily studied zero-shot learning that mapps different domain data toward an intermediate embedding space and align them to induce a zero-shot classifier. Although the experiments show the promising results to other typical solutions, I still have several comments for the authors: 1. The idea of embedding text and non-text data (image/vidoe) toward an intermediate space for zeroshot learning is well studied, the performance improvement is mainly from PubmedBert? or from various ontologies?
2. How about the proposed method compared with other zero-shot learning solution with structure labels for zeros-shot learning for the similar problem?
3. By the way, some protein function prediction solutions can make both few-shot and zero-shot learning based fucntion prediction by referring to the structure of GO, even the annotations of a particular GO terms are not available, it still can associate related genes to this GO term. 4. How about the proposed method handling 5 or more new classes at the same time, now the results are mainly performed on leave one out fashion, which is too optimised to give the right solution.
5. How the proposed method borrows information from multiple ontologies or achieve knowledge transfer from text and biological entities can be further clarified. The figure 1 gives the main idea, how about the mathmatical formulations?
Overall, the paper is interesting and can add values to the general biological data mining and discovery.